Skip to content

Conversation

elmiko
Copy link
Contributor

@elmiko elmiko commented Sep 12, 2025

What type of PR is this?

/kind bug

What this PR does / why we need it:

Which issue(s) this PR fixes:

Fixes #8494

Special notes for your reviewer:

this is a challenging scenario to debug, please see the related issue.

Does this PR introduce a user-facing change?

The ClusterAPI provider will not scale down a MachineDeployment that is undergoing an update.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/bug Categorizes issue or PR as related to a bug. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-area labels Sep 12, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: elmiko
Once this PR has been reviewed and has the lgtm label, please assign feiskyer for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added area/cluster-autoscaler area/provider/alicloud Issues or PRs related to the AliCloud cloud provider implementation area/provider/aws Issues or PRs related to aws provider area/provider/azure Issues or PRs related to azure provider area/provider/cluster-api Issues or PRs related to Cluster API provider area/provider/coreweave and removed do-not-merge/needs-area labels Sep 12, 2025
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. area/provider/digitalocean Issues or PRs related to digitalocean provider area/provider/equinixmetal Issues or PRs related to the Equinix Metal cloud provider for Cluster Autoscaler area/provider/externalgrpc Issues or PRs related to the External gRPC provider area/provider/gce area/provider/hetzner Issues or PRs related to Hetzner provider area/provider/huaweicloud area/provider/ionoscloud area/provider/kwok Issues or PRs related to the kwok cloud provider for Cluster Autoscaler area/provider/linode Issues or PRs related to linode provider area/provider/magnum Issues or PRs related to the Magnum cloud provider for Cluster Autoscaler area/provider/oci Issues or PRs related to oci provider area/provider/rancher area/provider/utho Issues or PRs related to Utho provider labels Sep 12, 2025
@elmiko
Copy link
Contributor Author

elmiko commented Sep 12, 2025

i'm still working on some clusterapi specific unit tests, but they are quite challenging given the mocks that are needed.

@elmiko
Copy link
Contributor Author

elmiko commented Sep 16, 2025

+1 to what @sbueringer is saying, and also this problem is currently confined to clusterapi provider but it could affect any provider who does node updating in a similar fashion as clusterapi. i think we need to make the autoscaler smarter in these scenarios where a cloud provider needs to have more control over which nodes are being marked for removal during a maintenance window.

@elmiko elmiko force-pushed the add-good-candidate-node-interface branch from cae87c7 to 150cf78 Compare September 16, 2025 19:16
@elmiko
Copy link
Contributor Author

elmiko commented Sep 16, 2025

updated with @sbueringer 's suggestions.

@sbueringer
Copy link
Member

Answered above

This function allows cloud providers to specify when a node is not a
good candidate for scaling down. This will occur before the autoscaler has
begun to cordon, drain, and taint any node for scale down.

Also adds a unit test for the prefiltering node processor.
The initial implementation of this function for clusterapi will return
that a node is not a good candidate for scale down when it belongs to a
MachineDeployment that is currently rolling out an upgrade.
@elmiko elmiko force-pushed the add-good-candidate-node-interface branch from 150cf78 to a81913f Compare September 17, 2025 14:48
@elmiko
Copy link
Contributor Author

elmiko commented Sep 17, 2025

updated again with the change i missed and the correction to the logic.

@sbueringer
Copy link
Member

Prod code changes lgtm from my side

@aleksandra-malinowska
Copy link
Contributor

aleksandra-malinowska commented Sep 17, 2025

No, this is about preventing cluster autoscaler to delete / scale down a Node altogether if Cluster API is doing a rollout.

Adding this annotation to a node always prevents CA from removing it. How is this insufficient?

The "if Cluster API is doing a rollout" part would have to be handled on the Cluster API side, of course. Since it's already cordoning and draining the node, it seems reasonable to assume it could add an annotation, too.

@sbueringer
Copy link
Member

sbueringer commented Sep 17, 2025

The "if Cluster API is doing a rollout" part would have to be handled on the Cluster API side, of course. Since it's already cordoning and draining the node, it seems reasonable to assume it could add an annotation, too.

At the point where the issue occurs Cluster API is not even aware of that autoscaler is cordoning/tainting/draining the Node

cluster-autoscaler will cordon/taint/drain the node and then eventually at the end in RemoveNodes tell Cluster API to delete the Node/Machine. Up until that point Cluster API has no idea about what autoscaler has been doing with the Node. Until then it assume this Node/Machine is perfectly fine and usable.

@elmiko elmiko changed the title WIP Add good candidate node interface Add good candidate node interface Sep 17, 2025
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 17, 2025
@elmiko
Copy link
Contributor Author

elmiko commented Sep 17, 2025

i think @sbueringer summed it up nicely, we need a way for the cloud provider to have an influence on how the autoscaler will perform, or not, the steps prior to reducing the size of a node group.

i think it is also worth noting that for cloud providers who may create more nodes that their node group size indicates, we need an automatic way for the provider to indicate to the autoscaler that a node should be excluded from being considered for download. although adding the annotation might help in some scenarios, i think we need a mechanism that the provider could exercise which does not require an update to a resource object or api server.

@elmiko elmiko changed the title Add good candidate node interface Add IsNodeCandidateForScaleDown cloud provider interface Sep 17, 2025
@aleksandra-malinowska
Copy link
Contributor

At the point where the issue occurs Cluster API is not even aware of that autoscaler is cordoning/tainting/draining the Node

Thanks for clarifying that draining referred to in #8494 is part of the scale-down process, not upgrade process.

From the chart in #8494 I assumed Cluster API was also draining nodes before deleting the underlying machine as part of the upgrade (hence "CA observes (...) old nodes that have been cordoned/drained").

The problem is that if autoscaler tries to delete / scale down a Node during a rollout there's a high chance that it will end up deleting the wrong Node (and that then repeats until we have no Nodes anymore for a node group)

+1 to what @sbueringer is saying, and also this problem is currently confined to clusterapi provider but it could affect any provider who does node updating in a similar fashion as clusterapi. i think we need to make the autoscaler smarter in these scenarios where a cloud provider needs to have more control over which nodes are being marked for removal during a maintenance window.

CA's job in a scale-down scenarios is "choose which node will be deleted and make it happen". If something else overrides CA's choice and chooses which node will be removed, then CA can't do its job.

I don't think there's any way to make CA smarter as in "allow it to do its job even when an external API is actively preventing it". Quite possibly the best it can do is to stop trying (disable scale-down) when it believes an external API will override its decisions anyway.

This is exactly what this PR does, and to be clear, I don't object to this mitigation. My only objection is adding a new method, and one very specific to this scenario, to the CloudProvider interface. There are other ways to achieve the same effect, and I'd like to explore them before growing that interface.

If adding the annotation isn't possible because Cluster API doesn't drain nodes before deleting machines (does it?), then perhaps this can be handled at a node group level? Does a MachineDeployment or a MachineSet correspond to a NodeGroup object? If yes, the option to disable scale-down can probably be added to https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/config/autoscaling_options.go#L39.

@sbueringer
Copy link
Member

sbueringer commented Sep 18, 2025

If adding the annotation isn't possible because Cluster API doesn't drain nodes before deleting machines

CAPI does drain Nodes before it deletes Machines, but as mentioned above CAPI is not aware that cluster-autoscaler is starting to delete the Node. This works something like this:

  1. autoscaler gets scale down candidates
  2. autoscaler selects a Machine for scale down
  3. autoscaler: cordons, taints and drains the Node
  4. autoscaler calls RemoveNodes which means it triggers Machine deletion in CAPI (by scaling down the MD and marking a Machine so it gets prioritized for deletion)

Only after 4. CAPI is aware that this Node is in deletion, before that for CAPI this Machine is a healthy Machine like every other.

The problem we are trying to address is that during 4. autoscaler makes assumptions on how it can delete a specific Machine ("by scaling down the MD and marking a Machine so it gets prioritized for deletion") that are not true during MachineDeployment (MD) rollouts. Because of that we want to ensure that the autoscaler does not scale down a Machine when a MD is rolling out (including that it should not do steps 1-3).

The CAPI Machine deletion process looks like this: (very short version)

Another orthogonal problem in our opinion is that with CAPI the autoscaler should never do steps 1.-3. as CAPI has it's own logic for that.

But the current PR just tries to solve the issue during MD rollouts for now as it can have catastrophic consequences (scaling down a MD to 0 during MD rollouts).

CA's job in a scale-down scenarios is "choose which node will be deleted and make it happen". If something else overrides CA's choice and chooses which node will be removed, then CA can't do its job.

We are attempting to tell CA: "this MD is going through a rollout, you should not scale down any Node/Machine of a MD during rollout".

This is exactly what this PR does, and to be clear, I don't object to this mitigation. My only objection is adding a new method, and one very specific to this scenario, to the CloudProvider interface. There are other ways to achieve the same effect, and I'd like to explore them before growing that interface.

I personally don't have a strong opinion on how it's implemented, I only would like to be able to not have Nodes of a MD during rollout in scale down candidates so we can avoid the scale down entirely (including steps 1.-3.). I think today we can only block in step 4., which is way too late as the Node is already cordoned/tainted/drained at this point.

Does a MachineDeployment or a MachineSet correspond to a NodeGroup object?

For the current case a MachineDeployment corresponds to a NodeGroup object

@aleksandra-malinowska
Copy link
Contributor

Thanks for the detailed explanation, much appreciated!

I personally don't have a strong opinion on how it's implemented, I only would like to be able to not have Nodes of a MD during rollout in scale down candidates so we can avoid the scale down entirely (including steps 1.-3.). I think today we can only block in step 4., which is way too late as the Node is already cordoned/tainted/drained at this point.

For the current case a MachineDeployment corresponds to a NodeGroup object

This sounds like a node group option will work then. It can be checked at the same point as in the current PR.

Another orthogonal problem in our opinion is that with CAPI the autoscaler should never do steps 1.-3. as CAPI has it's own logic for that.

Does Cluster API ensure that all pods from the node it selects in (2) will be able to run elsewhere in the cluster? If it does, then I guess it could work, although at this point CAPI may just as well do (4), too.

If not, doing (4) without (2) would risk some workloads becoming unschedulable before triggering a scale-up, essentially just causing an unnecessary disruption. I believe the only scenario where it's guaranteed to work well is removing 1 empty node at a time, which is pretty limited. But as you said, this discussion is unrelated to this PR.

@sbueringer
Copy link
Member

sbueringer commented Sep 18, 2025

This sounds like a node group option will work then. It can be checked at the same point as in the current PR.

I think you are right. The difference is that this only allows to disable scale down for an entire node group, not for single nodes. But our current use case is also only to disable it for an entire node group.

But as you said, this discussion is unrelated to this PR.

Yup, let's discuss this separately once we get to that

@elmiko
Copy link
Contributor Author

elmiko commented Sep 18, 2025

we discussed this issue/pr at the sig office hours today, @towca suggested looking at the custom scale down processors as another possible option. i had looked at the scale down processors initially, but didn't arrive at a solution. i am going to investigate that path again as i tend to agree with @aleksandra-malinowska 's point about not adding complexity to the cloud provider interface if we don't need it.

@wjunott
Copy link
Contributor

wjunott commented Sep 19, 2025

we discussed this issue/pr at the sig office hours today, @towca suggested looking at the custom scale down processors as another possible option. i had looked at the scale down processors initially, but didn't arrive at a solution. i am going to investigate that path again as i tend to agree with @aleksandra-malinowska 's point about not adding complexity to the cloud provider interface if we don't need it.

Since it comes to this discussion point, do we still consider putting the logic into capi provider's NodeGroupForNode() implementation approach? and I see nil is not processed by all of this api's references, and we may enhance there if nil is commented to be 'if the node should not be processed by cluster autoscaler'. @towca @elmiko

@sbueringer
Copy link
Member

sbueringer commented Sep 19, 2025

Since it comes to this discussion point, do we still consider putting the logic into capi provider's NodeGroupForNode() implementation approach? and I see nil is not processed by all of this api's references, and we may enhance there if nil is commented to be 'if the node should not be processed by cluster autoscaler'. @towca @elmiko

I think this stills applies: #8531 (comment)

Still seems like the riskiest option to me. I'm not sure I follow your suggestion. Just to highlight it again, we want to specifically only disable scale down during MD rollout (eg not scale up)

@elmiko
Copy link
Contributor Author

elmiko commented Sep 19, 2025

Since it comes to this discussion point, do we still consider putting the logic into capi provider's NodeGroupForNode() implementation approach? and I see nil is not processed by all of this api's references, and we may enhance there if nil is commented to be 'if the node should not be processed by cluster autoscaler'. @towca @elmiko

no, we can't use NodeGroupForNode in this case. there is extra context related to the scale down process that cluster-api needs to understand.

@elmiko
Copy link
Contributor Author

elmiko commented Sep 23, 2025

i have been investigating an alternative implementation that would allow cloud providers to inject custom processors into the core autoscaler configuration. so far, it seems that the processors change would also require at the least a refactor of the NewCloudProvider interface to allow passing the processors.

i think the idea of allowing cloud providers to add customer processors is actually really powerful. i will sketch that out as well, and bring it up for discussion this week at the sig meeting.

@elmiko
Copy link
Contributor Author

elmiko commented Sep 26, 2025

we discussed this PR at the sig meeting yesterday. i am going to proceed with creating the alternate patch, i think it will give a better end result in terms of giving providers more functionality.

i will test the alternate solution to ensure it continues to solve the problem.

@elmiko
Copy link
Contributor Author

elmiko commented Sep 26, 2025

the refactor is looking good and does not add complexity to the cloud provider interface, although it does modify how cloud providers are constructed.

i will post a PR once i finish some local testing.

@elmiko
Copy link
Contributor Author

elmiko commented Sep 29, 2025

i have created #8583 as an alternate solution. it appears to solve the same issues in my local test environment and has the advantage of not introducing a new cloud provider interface function.

i think i prefer the 8583 solution, but i will leave both PRs open for a few days to gather comments. if there are no objections, i will most likely close this PR by the end of the week.

@elmiko
Copy link
Contributor Author

elmiko commented Oct 2, 2025

i am closing this in favor of #8583. i think the other PR is less complex and gives us more options in the future.

@elmiko elmiko closed this Oct 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cluster-autoscaler area/provider/alicloud Issues or PRs related to the AliCloud cloud provider implementation area/provider/aws Issues or PRs related to aws provider area/provider/azure Issues or PRs related to azure provider area/provider/cluster-api Issues or PRs related to Cluster API provider area/provider/coreweave area/provider/digitalocean Issues or PRs related to digitalocean provider area/provider/equinixmetal Issues or PRs related to the Equinix Metal cloud provider for Cluster Autoscaler area/provider/externalgrpc Issues or PRs related to the External gRPC provider area/provider/gce area/provider/hetzner Issues or PRs related to Hetzner provider area/provider/huaweicloud area/provider/ionoscloud area/provider/kwok Issues or PRs related to the kwok cloud provider for Cluster Autoscaler area/provider/linode Issues or PRs related to linode provider area/provider/magnum Issues or PRs related to the Magnum cloud provider for Cluster Autoscaler area/provider/oci Issues or PRs related to oci provider area/provider/rancher area/provider/utho Issues or PRs related to Utho provider cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CA ClusterAPI provider can delete wrong node when scale-down occurs during MachineDeployment upgrade
5 participants